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1. Introduction 

Astrophysical applications related to the physics of the early universe, as well as 
challenges posed by the physics programs at new heavy ion accelerators, have triggered a 
renewed interest in the understanding of real time processes in the context of quantum 
field theory. With the advent of new computer technology and the recent success of 
new computational schemes, non-equilibrium phenomena which have been previously 
studied only in the framework mean- field theory §, [|, are now being revisited, and 
more complex next to leading order approaches @, |], [| [7j are being used in an attempt 
to clarify the role played by the rescattering mechanism, which is responsible for driving 
an out of equilibrium system back to equilibrium. Of particular interest is the study of 
the dynamics of phase transitions and particle production following a relativistic heavy- 
ion collision. One way of approaching this study is based on solving Schwinger Dyson 
equations within the closed time path (CTP) formulation |J. This formalism has been 
recently shown to provide good approximations of the real time evolution of the system 
both in quantum mechanics and 1+1 dimensional classical field theory ||, where direct 
comparisons with exact calculations can be performed. 

The key element in carrying out such studies is related to the calculation of 
the two-point Green function, which is solved for self-consistently with the equations 
of motion for the fields. The two-point Green function gives rise to Volterra-type 
integral or integro-differential equations. In the process of extending our study to 
encompass a higher number of spatial dimensions, i.e. 2+1 and 3+1 field theory, we 
are faced with the challenge of coping with constraints dictated both by storage and 
time-related computational limits. Thus our interest in designing algorithms which 
feature spectral convergence in order to achieve convergence with minimum storage 
requirements. In addition, we also desire these algorithms to scale when ported to 
massively multiprocessor (MPP) machines, so that solutions can be obtained in a 
reasonable amount of time. 

Algorithms for Volterra integral and integro-differential equations usually start 
out at the lower end of the domain, a, and march out from x = a, building up the 
solution as they go fllP] . Such methods are serial by nature, and are, in general, not 



suitable for parallel implementation on a MPP machine. Even so, clever approaches 
to already existing methods can provide algorithms that take advantage of a parallel 
processing computer: Shaw [|ll|] has shown recently that once the starting values of 
the approximation are obtained, one can design a global approach where successive 
approximations of the solution over the entire domain x G [a, b] can be evaluated 
simultaneously. 

In a recent paper [O one of us has discussed a spectral method [[TTJ of solving 
some types of equations of interest for the study of time-dependent nonequilibrium 
problems in quantum field theory. The gist of the method consists in expanding out the 
unknown function in terms of Chebyshev polynomials on a suitable grid, thus reducing 
the problem to finding the numerical solution of a system of linear equations. The main 
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advantage of this method over standard finite-difference type methods resides in the 
spectral character of its convergence. This is related in part to the fact that Chebyshev 
type methods use a non-uniform grid, while finite-difference methods require a uniform 
grid. Usually there is a trade-off between computational time and storage requirements, 
and a balanced solution must be reached on a case-by-case basis. Spectral methods 
are more expensive per point as the matrices may be considerably denser than in the 
finite-difference case, but we require considerably fewer grid points in order to achieve 
the same degree of accuracy. By expanding the unknown function on a compact support 
in Chebyshev polynomials and using a partition of the domain based either on the set 
of (N+l) extrema or the set of N zeros of Tjj(x) - the Chebyshev polynomial of first 
kind of degree N - we in fact replace a continuous problem by a discrete one. For non- 
singular functions the discrete orthogonality and completeness relations for Chebyshev 
polynomials at the above grid points assure a defacto exact expansion for an arbitrary 
finite value N. In practice however, one has to compute derivatives and integrals of 
the unknown function at the collocation points, and the Chebyshev expansion provides 
only an approximation for these subsequent computations. These errors, together with 
the finite accuracy of numerical methods needed in conjunction with the Chebyshev 
expansion, conspire in order to deteriorate the accuracy of the solution at very small 
values of N. 

The paper is organized as follows: In Section for comparison purposes, we start by 
reviewing a finite-difference approach for the numerical solution of Volterra type integro- 
differential equations. We review the general framework of the Chebyshev-expansion 
method in Section |3], and illustrate our approach for the case of Volterra integro- 
differential equations. In Section ^ we present a complete assessment of the convergence 
and computational cost of the proposed method for the case of a test problem, and 
compare with results obtained via the finite-difference method. In Section ^ we discuss 
the relevant aspects of a large-scale calculation arising in the study of time-dependent 
quantum field theory, for which our numerical strategy is particularly suitable. We 
present our conclusions in Section |(| 

2. Stable multi-step method for Volterra type equations 

The type of problems arising in the study of time- dependent nonequilibrium quantum 
field theory via a Schwinger-Dyson equation approach, can be formally reduced to the 
general case of a nonlinear Volterra integro-differential equation. Direct methods for 
solving nonlinear Volterra integral and integro-differential equations are inherently serial 
and therefore have not received much attention for use on a parallel computer. It is worth 
mentioning here the work of Crisci et al ||14|| , who concentrated on the stability aspects of 
parallel iteration of Volterra-Runge-Kutta (VRK) methods for solving Volterra integral 
equations on parallel computers. VRK methods are step-by-step methods and can take 
advantage of parallel architecture. Sommeijer et al covered the stability of parallel 
block methods for ordinary differential equations (ODE) and included equations of the 
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integro-differential type in their discussion. 

We summarize here a recent parallel algorithm which concentrates on 

modifying the algorithmic side of the numerical solution process for use on a parallel 
processor while consciously utilizing methods that are known to be stable. The 
algorithm is in effect an example of a higher-order finite-difference approach, and we 
use this approach to compare with the spectral method presented later in this paper. 

For illustration, let us consider a first-order nonlinear Volterra integro-differential 
equation of the form 

y '(x) = F[x,y,Z[x;y]], x e [a, b] , (1) 

with 



Z[x;y}= / K[x,t;y(t)]dt , (2) 

J a 

and subject to the initial condition 

y(o) = Vo ■ (3) 

Let In be a partition of I=[a,b], where In = {xn = a + nh, n = 0(1) N, Nh = (b — a)}. 
The problem is to find approximations y n to the solution y(x n ) of Eqs. (|l|-|3|) for each 
x n G In- A fc-step method for an integro-differential equation of the form ([J) is given 
by 



y n +i = y n + hJ2 w j F ( x n-j,y n -j,Zn-j) , n = k(l)N, (4) 

3=0 



where 



n-j 



Z, 



n-j 



h c n -j,iK(x n -j, Xi, Ui) , j = 0(1) k , ?/o = y(a) . (5) 

i=0 

The weights Wi depend on the fc-step method selected and the weights Qj are those of a 
standard quadrature formula for integrating a function whose value is known at equally 
spaced steps, such as a Newton-Cotes or Newton- Gregory quadrature rule. For our 
multi-step (fc = 4) method [|l(| we choose the fourth order Adams-Bashforth predictor 

h r 

Vk+i =yk + ^ I 55 F ( x k, Vk, z k ) - 59 F(x fc _i, y fc _ b z k ^) (6) 
+ 37F(x fc _ 2 , yk-2, Zk-2) - 9 F(x k - 3 , y fc _ 3 , z k ~z) 

and the Adams-Moulton corrector 

h r 

Vk+i = yk + [ 9 F ( x k+i, y° k+ i, Zk+i) + 19 F(x k , y k , z k ) (7) 
- 5 F(x fc _i, y fe _i, z k -i) + F(x k ^ 2 , Vk-2, Zk-2) 

while the integral term (0) is calculated based on the Newton-Gregory quadrature 
formula. We use a fourth order Runge-Kutta method in order to start out the 
calculation. 

In order to make the algorithm suitable for parallel processing, it is useful to recall 
that a standard quadrature method based on an uniform grid for the integral term 
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requires knowledge of the integrand function at the abscissas in the interval [x ,xj. 
This is obviously a serial process and not a good candidate for parallelization. It can 
be observed however, that once the starting values are obtained, all approximations Zi 
with % = 0(1) k — 1 can simultaneously be evaluated up to and including x^-i- After 
that, once a value of i/j corresponding to a new step Xj is established via the predictor- 
corrector method, all values Zi with % = A can also be evaluated simultaneously. 
This observation makes the following algorithm possible: 

(i) Find the starting values (y^, zi) with i = 0(1) k — 1 

(ii) doi = k,N 

add contributions to corresponding to (xj,yj), where j = 0(1) k — 1 

(iii) do i = k, A 

(a) predict yi 

(b) estimate Z; L from (xi,yi) 

(c) correct i/i 

(d) do j = i,N 

update Zj by adding the contribution corresponding to (xi, yi) 

The above numerical algorithm is implemented using the OpenMP style directives 
for the Portland Group's pgf77 FORTRAN compiler, and reportedly shows good 
scalability on a shared-memory multiprocessor. The speedup of the finite difference 
method is best for a large number of grid points which, correspondingly, gives a better 
solution approximation. For example, with N=5120 and 4 processors the speedup is 
3. 86, a good measure of processor utilization. 

While the preceding algorithm performs well on a shared memory platform, it 
does not port easily to an MPP machine. Before we comment on the efficiency of 
the algorithm, let us make two general comments: Firstly, we denote by T cg ^ c and 
Tcomm the time required to perform a floating-point operation and the time required 
to send a floating-point number, respectively. Secondly, we will ignore for simplicity the 
effect of message sizes on communication costs, and assume throughout that the ratio 
Tcomm/^caic is independent of A. 

Returning now, to our proposed algorithm, we remark that the communication 
cost for the corresponding implementation involves only the integral terms. Even so, 
using the message-passing interface (MPI) protocol the communication cost is 4 log A 
for the starting values and up to A 2 for the remainder of the algorithm which gives 
a total of (A 2 + 41ogA)T C omm- The total number of flops depends on the specific 
application but a reasonable measure is the number of function evaluations which is 
given by (A 2 + 4A)T ca j c . The ratio of communication to computation 

A 2 + 41ogAT comm 
N 2 + *N T calc 

approaches a constant value as A gets larger. The communication overhead problem 
can be relaxed by employing a spectral method discussed in the following section, the 
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improvement being especially significant for a multidimensional problem of the type 
required by our nonequilibrium quantum field theory calculations |J. 

3. Spectral method with Chebyshev polynomials 

Consider the N + 1 extrema of the Chebyshev polynomial of the first kind of degree N, 
Tn(x). This set defines a non-uniform grid in the interval [—1, 1], as 

x k = cosf^J , k = 0(l)N. (8) 

On this grid, the Chebyshev polynomials of degree i < n obey discrete orthogonality 
relations 

N 

E" T^XkjTj^k) = PiO~ij, (9) 

k=0 

where the constants fa are 

r n 

ft = 2' '*<>•»• 
[ N, % = 0,N. 

Here, the summation symbol with double primes denotes a sum with both the first 
and last terms halved. We approximate an arbitrary continuous function of bounded 
variation f(x) in the interval [—1,1], as 

N 

/(*)« E'^iW, (io) 

3=0 

with 

2 N 

^■ = ^E"/W fc ), j = o(i) at. (ii) 

fc=0 

Eq. (|T0|) is exact at x equal to Xk given by Eq. @. Based on Eq. (|10|) , we can also 
approximate derivatives and integrals as 

N n JV 

/'(*) « E" /(**) E" W • ( 12 ) 

fc=o iV i=o 

and 

r x N o N r x 

/ /(t)dt^'7fe) ¥ E"^) / T,(t)dt. (13) 
fc=o iV i=o 

In matrix format, we have 



/(t)dtj «S [/] , (14) 

[/'(*)] « ^ [/] , (15) 

The elements of the column matrix [/] are given by f{xk), k = 0(1) A?". The right- 
hand side of Eqs. ([14]) and (|i~5[) give the values of the integral /f x /(£) d£ and the 
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derivative f'(x) at the corresponding grid points, respectively. The actual values of the 
elements of the matrices S and D can be derived using Eqs. ( |T2| , [X3|) . 

In order to illustrate the Chebyshev algorithm, we consider again the case of a 
first-order nonlinear Volterra integro-differential equation of the form 

F[x,y,Z[x;y]) , xe[a,b], 



y [X) 
Z[x;y 

with the initial condition 



K[x,t;y{t)]dt 



y w = yo ■ 

Here we make no explicit restrictions on the actual form of the function F[x, y, Z[x; y]], 
so both linear and nonlinear equations are included. We determine the unknown function 
y(x) using a perturbation approach: We start with an initial guess of the solution yo(x) 
that satisfies the initial condition yo(a) = yo, and write 

y(x) = yo(x) + e(x) , 

with e(x) being a variation obeying the initial condition 

e(a) = . (16) 

Hence, the original problem reduces to finding the perturbation e(x), and improving the 
initial guess in a iterative fashion. 

We use the Taylor expansion of F[x, y, Z[x; y]] about y(x) = yo(x) and keep only 
the linear terms in e(x) to obtain an equation for the variation e(x) 

dF[x,y,Z[x;y}} 



e (x - 



dy(x) 
dF[x,y,Z[x;y}} 



e(x) 



y(x)=yo(x) 



dK[x,t;y(t)\ 



dZ[x;y] 

= -y'o( x ) + F i x , yo(af), Z[x; yo(x)]] . 
Equation (|I7D is of the general form (|18D 
e'(x) = q[x, e(x)] + r(x) , 



dy(x) 



e(t)dt 



y(x)=yo{x) 



(17) 
(18) 



where 



q[x,e(x)] 



dF[x,y,Z[x;y}} 



+ 



dy(x) 
dF[x,y,Z[x;y] 



dZ[x-y] 



e{x) 

y(x)=yo(x) 

- dK[x,t;y{t)} 



y(x)=yo(x) 



dy(x) 



e(t)dt 



y(x)=yo(x) 



and 



r(x) = - y' (x) - F[x, y (x), Z[x; y]] , 

together with the initial condition given by flUt ). We replace Eqs. and (|l^) by an 
integral equation, obtained by integrating Eq. (|18D and using the initial condition ([jj]) 
to choose the lower bound of the integral. We obtain 



e{x) 



q[t,e(t)] dt+ / r{t)dt 



(19) 
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which is in fact a linear Volterra integral equation of the second kind. Using 
the techniques developed in the previous section to calculate integrals, the integral 
equation (|19D can be transformed into a linear system of equations. A practical 
implementation of this algorithm is illustrated via a test problem in the following section. 



4. Test problem 



Following Shaw [[LI]] , we consider the test problem 

1 



i-yW _ 



y{x) = xe 
y(0) = y = 1 , 

which has the exact solution 

y(z) 



;i + x) 2 

x e [0, l] 



X 



X 



o (i + ty 



; 1_y Wdt , 



(20) 
(21) 

(22) 

1 + x K ' 

We shall use the initial guess yo(x) = l/o cos ( a; ); so that yo(0) = y§. The equation for 
the variation e(x) is 

rx r s o(=i-yo(t) 

te 1 - yo(t) e(t)dt + ds -j- w e(t)dt = (23) 

Jo Jo ( 1 1 ' ' 



ex - 



yo(x) +y + 



te i-yoW _ 



dt- 



ds 



se 



i-yo(t) 



o (1 + t) 2 



dt. 



(1 + i) 2 

In matrix format and using the Chebyshev expansion presented above, the variation 
e(x) will be obtained as the solution of linear system of equations 

A [e] 



C 



with matrices A and C given as 



(24) 



A 



'j 



Oj j S, 



te i-yo(*) 



+ Sik x kSkj 



,i-yo(*) 



Ci = - [yo{t)]i + yo + S ik 



te i-y (t) _ 



i,j = 0(1) N 
1 



- 1 



Sik x kS, 



kl 



e i-yo(*) 



From a computational point of view the computer time is spent initializing the 
matrix elements A^ and Cj on one hand, and finding the solution of on the other. 
On the first matter, the calculation decouples nicely, and once we have the vector [y ], 
we can calculate {Ci,Aij,j = 0(1) N} in parallel for i = 0(1) N. The algorithm is as 
follows: 

(i) calculate [y ] = [y ] + [e] ; 

(ii) broadcast [y ] ; 
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(iii) do i = 0, N : 

(a) master to slave: send i ; 

(b) slave: compute {Ci,Aij,j = 0(1) N} ; 

(c) slave to master: return {Cj, Aij,j = 0(1) N}. 

Regarding the second step, i.e. solving the linear system of equations, the best 
choice is to use the machine specific subroutines, which generally outperform hand- 
coded solutions. When such subroutines are not available, as in the case of a Linux 
based PC cluster for instance, one can use one of the MPI implementations available 
on the market. We shall see that the efficiency of the equation solver is critical to 
the success of the parallel implementation of the Chebyshev-expansion approach. In 
order to illustrate this aspect we perform two calculations, first using a LU factorization 
algorithm, and secondly using an iterative biconjugate gradient algorithm. These are 
standard algorithms [10| for solving systems of linear equations, but their impact on the 
general efficiency of the approach is quite different. 



4-1- Serial case 

Figure [l] depicts the average CPU time required to complete the calculation for the 
various methods. Figure |2] illustrates the convergence of the two numerical methods. 
The spectral character of the method based on Chebyshev polynomials allows for an 
excellent representation of the solution for iV > 12. We base our findings on a a < 10~ 10 
criteria, where a denotes the sum of all absolute departures of the calculated values from 
the exact ones, at the grid points. 

The number of iterations required to achieve the desired accuracy in the Chebyshev 
case is depicted in Fig. |3|. The number of iterations becomes flat for N > 12, and 
stays constant (17 iterations) even for very large values of N. The higher number of 
iterations corresponding to the lower values of N, represents an indication of a insufficient 
number of Chebyshev grid points: the exact solution cannot be accurately represented 
as polynomial of degree N for x £ [0, 1]. It is interesting to note that for N = 12 — 16, 
a reasonable lower domain for the representation of the solution using Chebyshev 
polynomials, the reported CPU time is so small that for our test problem there is 
no real justification for porting the algorithm to a MPP machine. This situation will 
change for multi-dimensional problems such as those encountered in our nonequilibrium 
quantum field theory studies. 



4-2. Parallel case 

The LU factorization algorithm is an algorithm of order N 3 and consequently, most of the 
CPU time is spent solving the linear system of equations (see Fig. f|). As a consequence, 
a parallel implementation of the LU algorithm is very difficult. Figure |5| shows how the 
average CPU time changes with the available number of processors. Here we use a very 
simple MPI implementation of the LU algorithm as presented in reference [f[6| . Even 
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though we could certainly achieve better performance by employing a sophisticated LU 
equation solver, the results are typical. Since the actual size of the matrices involved 
is small, the communication overhead is overwhelming and the execution time does not 
scale with the number of processors. 

Fortunately, even for dense matrices and small values of the number of grid 
points N, one can achieve a good parallel efficiency By employing an iterative method 
such as the iterative biconjugate gradient method, one can render the time required 
to solve the system of linear equations negligible compared with the time required to 
initialize the relevant matrices, which in turn is only slightly more expensive than the 
initialization process of the LU factorization algorithm. The initialization process can be 
parallelized using the algorithm presented above and the results are depicted in Fig. ||. 

It appears that by using the biconjugate gradient method the efficiency of the 
parallel code has improved considerably. However, the average CPU time saturates to 
give an overall speedup of 3.5 . This can be understood by analyzing the computation 
and communication requirements for our particular problem. The calculation cost 
to initialize the matrices A and C is roughly given by the number of floating-point 
multiplications and additions (7N 2 + 3iV)T ca j c , while the communication cost is given 
by (iV 2 + 2iV)T comm . Therefore, the ratio of communication to computation is 

N 2 + 2N T CO mm 
7N 2 + 3N T calc ' 

As in the finite-difference case, this ratio approaches a constant value as N gets larger 
and it becomes apparent that the communication overhead is still a problem. 

However, multi-dimensional applications such as those presented in require 
complicated matrix element calculation. In such cases, the process of initializing the 
matrices A and C is quite involved, and the ratio of the communication time relative 
to the computation time becomes favorable. In addition, the matrix A becomes sparse 
and the size of the linear system of equations is substantially larger, thus one can also 
take advantage of existing parallel implementation of the iterative biconjugate gradient 
algorithm |T7|]. Such problems benefit heavily from an adequate parallelization of the 
code. We will discuss such an example in the following section. 



5. Volterra-like integral equations for a two-point Green function 

Schwinger, Bakshi, Mahanthappa, and Keldysh || have established how to formulate an 
initial value problem in quantum field theory. The formalism is based on a generating 
functional, and the evolution of the density matrix requires both a forward evolution 
from zero to t and a backward one from t to zero. This involves [|18J both positive and 
negative time ordered operators in the evolution of the observable operators and the 
introduction of two currents into the path integral for the generating functional. Time 
integrals are then replaced by integrals along the closed time path (CTP) in the complex 
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P /"OO /"OO 

F(t)dt = / F+(t)dt- F(t)dt. (25) 

JC J0:C+ JO-.C- 

Using the CTP contour, the full closed time path Green function for the two point 
functions is: 

g(t,t') = 0>(t,f)e c (t,o + g<{t,t')Q c {t',t) , 

in terms of the Wightman functions, G >j< (t,t'), where the CTP step function Qc(t,t') 
is defined by: 

Q(t, t') for t on C + and t! on C + , 
Qu t >\ < for t on C + and t' on C_ , 

' II for £ on C_ and £' on C + , 

0(t', t) for t on C_ and t' on C_ . 

For complete details of this formalism and various applications, we refer the reader to 
the original literature || [181 , and we confine ourselves to discussing how our Chebyshev- 
expansion approach is applied to the computation of the two-point Green function. 

For simplicity we consider now the quantum mechanical limit of quantum field 
theory (0+1 dimensions). In this limit, we are generally faced with the problem of 
numerically finding the solution of equation 

g(t,t') = G(t,t') - J c dt"Q(t,t")g(t",t'), (27) 

Here, the Green functions, g(t,t') and G(t,t'), are symmetric in the sense that 
g > (t,t') = g < (t',t), and obey the additional condition 

£>,<(*, = -£<,>(M') = Q<,>(t',t). (28) 

The function Q(t,t') obeys less stringent symmetries 

<?>,<(*, = -Q<,>(M') Q<,>(t',t), (29) 

which is always the case when Q(t,t') has the form 

Q(t,t r ) = J c dt"A(t,t")B(t",t'), (30) 

where A(t,t') and B(t,tf) satisfy (p8|). 
We can further write Eq. (p8|) as 

Height')} = - fte{S<(M')}, (31) 

Jm{£> (t , } = Jm{£< (t, t') } , (32) 

or 

(t, o - g* (t , t') = 2 7ee{g> (t , } , (33) 

g> (t, + 6?* (t, = 2 Zm{£> (t , t') } . (34) 
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Hence, a Green function Q{t,t r ) is fully determined by the component G>{t,t') = 
He{G>(t,t f )} +i Im{G>{t,t')}, with t' < t. Thus, in order to obtain the solution 
of Eq. (|27j) , we only need to solve 

G>(t,t') = G > {t : t')-2 f ' dfKeiQ^t")} G>(t",t') 

Jo 

+ 2 f dfQ>(t,f) Ke{G>(t",t')} . (35) 
Jo 



We separate the real and the imaginary part of ( p5|) and obtain the system of integral 
equations 

Height')} =Ke{G > {t,t')}-2 / < dt"^e{g > (t,t")}^e{^ > (t",t')} 

Jo 

+ 2 f dt"^e{Q > (t,t")}^e{^ > (t",t')} (36) 
Jo 

Im{g>(M')} = Jm{G > (t,t / )} -2 / di"fte{Q > (M")}Xm{£ > (t",t / )} 

Jo 

+ 2 /"* dt" Jm{Q> (t, t") }TZe{G> (t\ t') } . (37) 
Jo 

The above system of equations must be solved for t! < t. The two equations are 
independent, which allows us to solve first for the real part of Q > (t,t'), and then use 
this result to derive the imaginary part of Q > (t,t'). 

Despite their somewhat unusual form, the above equations are two-dimensional 
Volterra-like integral equations and our general discussion regarding the Chebyshev 
spectral method applies. We will perform a multi-step implementation of the formalism. 
Let 

U = U (N-l)+h > I < H < N , 

be the grid location corresponding to the collocation point i x of the interval labelled i +l- 
Then, the discrete correspondent of Eq. (^) is 

0>(ti,tj) = G>(t j ,t i ) (38) 

=ko(N— l)+fei] 

k =0 ki=l 
N 

- ^[25 nfcl ]7ee{Q > (t i ,t fc 

[=j (A r -l)+fci])} G>{tk,tj) 

fei=l 
J'o-l AT 

+ X! ^ [2'5'jvfci]Q>(^, ^fc[=fe (iv-i)+fei])^e{^>fa) ^7)} 

fc =0 fei=l 
iV 

+ ^[25, lfcl ]Q > (t J ,t fc 



fcl=l 



with tj < ti 



We will refer now to Figs. [8] and |9|. Equation fl39|) involves values of G>(tk,tj), for 
which ^ > tfc. In such cases, we use the symmetry G>{tj,tk), which relates to the values 
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the two-point function located in the domain of interest. For the time interval (i + 1) 
the size of the linear system of equations we need to solve is 

l(i + l)(N - l)[(i + 1)(N - 1) + 1] - U (N - l)[i (N - 1) + 1] 

= io (jV_l)2 + ijV(#-l), 

or of order (i + h)(N — l) 2 . In practice, the value of N is taken between 16 and 32. 

Tables |l] and |2| summarize the number of floating-point operations performed 
in order to compute the non-vanishing matrix elements corresponding to a given i 
and j, (j < i). 

We can now calculate the ratio of communication to computation time, by noticing 
that the numbers in the tables above get multiplied by N, corresponding to the number 
of collocation points in each time step and summing over the number of steps, i.e. we 
evaluate 

io r i 

+ NJ2 3 <io(N-l) . 
io=i 

In Table |3| we summarize all relevant estimates regarding the computation cost for 
a fixed value of i. In order to estimate the total communication and computation 
cost, respectively, these numbers must be multiplied by an additional factor of N, 
corresponding to the number of possible values of i in a time step. This factor is 
not relevant for estimating the communication overhead, but it must be remembered 
when one infers the sparsity of the corresponding system of equations. 

To conclude we observe that the communication to computation ratio approaches 
1 ^comrn 
2(*o + l) T calc 

for large values of %q. Therefore for this problem the communication overhead is reduced 
substantially in the later stages of the calculation. In practice, this ratio is actually 
much better, as we compute the functions G(t,t') and Q(t,t') on the fly, and this adds 
considerably to the computational effort. Finally the sparsity of the resulting systems 
of equations goes to 2/(i N) for large values of iq and N, which supports our choice for 
an iterative equation solver. 

6. Conclusions 

We have presented a numerical method suitable for solving non-linear integral and 
integro-differential equations on a massively multiprocessor machine. Our approach 
is essentially a standard perturbative approach, where one calculates corrections to 
an initial guess of the solution. The initial guess is designed to satisfy the boundary 
conditions, and corrections are expanded out in a complete basis of N Chebyshev 
polynomials on the grid of (N+l) extrema of Tn(x), the Chebyshev polynomial of 
first kind of degree N. The spectral character of the convergence of the Chebyshev- 
expansion approach is the key element in keeping low the number of grid points. From 



N 



iij>i (N-l) 
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a computational point of view, each iteration involves two stages, namely initializing 
the relevant matrices and solving the linear system of equations. Both stages can be 
rendered parallel in a suitable manner, and the efficiency of the code increases when 
applied to complicated multi-step, multi-dimensional problems. 

The algorithm discussed in this paper represents the backbone of current investi- 
gations of the equilibrium and nonequilibrium properties of various phenomenological 
Lagrangeians. In particular we are interested in studying the properties of the chiral 
phase transition at finite density for a 2+1 dimensional four-fermion interaction as well 
as the dynamics od 2-dimensional QCD, with the ultimate goal of indirectly obtain- 
ing insights regarding the time evolution of a quark-gluon plasma produced following a 
relativistic heavy-ion collision. 
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Table 1. Summary regarding the calculation of 1ZeQ(ti,tj) at step io + 1 



integral 


domain 




non-zero elements 


additions 


multiplications 


Jo 1 


J < *o(^ - 


1) 


N 


i N 


(2*o + l)N 


Io' dt fc 


3<i (N- 


1) 





(jo + l)N 


(2j + l)N 


total 






N+l 


(i +jo + l)N+l 


2(i +j + l)N 


J? dt fc 


J > *o(^ - 


1) 


(i + l)(JV-l) + l 


io 


(io + 1)N 




J > to(N - 


1) 


(i + l)(iV-l) + l 


io 


(to + l)N 



total (j + i)(jV-l) + 2 (i + l)(N + l) 2(i + l)N 



Table 2. Summary regarding the calculation of ImQ(ti, tj) at step io + 1 



integral 


domain 




non-zero elements 


additions 


multiplications 


Io U dt k 


j < io(N - 


1) 


N 


toN 


(2i + 1)N 


Io 3 dt k 


3<io(N- 


1) 





(3o + l)N 


(23o + 1)N 


total 






N+l 


(io+3o + l)N + l 


2(i +j + l)N 


Io U dh 


3 > to(N - 


1) 


(i + l)(iV- 1) + 1 


io 


(to + l)N 


Io' dt k 


3 > io(N- 


1) 





(io + l)N 


(2i + 1)N 


total 






(i + l)(N-l) + 2 


(i + l)(N + l) 


(3i + 2)N 



Table 3. Global communication and computation data regarding the calculation of 
Q(ti, tj) at step i + 1 

equation floating-point numbers to be sent floating-point operations 

neQ(t,t') (3i + 1.5)iV 2 - (3i + 0.5)N (5.5i + 2)(i + 1)N 2 + Ni 

lmg(t,t') (3i + l-5)N 2 - (3i + 0.5)N [(5.5i + 2)(i + 1) + i ]N 2 + Ni 
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10000 



Figure 1. Average CPU time versus the number of grid points for the Chebyshev 
expansion approach using the either LU decomposition (squares) or the biconjugate 
gradient method (crosses), and finite-difference approach (circles). 
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Figure 2. Convergence of the Chebyshev result (filled) compared with the finite- 
difference result (empty), versus the number of grid points. 
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Figure 3. Number of iterations versus the number of grid points for the Chebyshev 
method. 
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Figure 4. Total CPU time (filled) and CPU time spent carrying out the LU 
decomposition (empty), versus the number of grid points (1 CPU case). 
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Figure 5. Scaling of the average CPU time with the number of available processors 
for the Chebyshev expansion approach and the LU factorization algorithm (N=500). 
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Figure 6. Scaling of the average CPU time with the number of available processors for 
the Chebyshev expansion approach and the biconjugate gradient algorithm (N=500). 
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Figure 7. Complex time contour C for the closed time path integrals. 




